Rectangular data
Non-rectangular data - Hierarchical data (xml, html, json) - Time series data - Unstructed text data - Images/Pictures data
Rectangular data
.xls).sav, STATA: .dat, etc.)Non-rectangular data - XML and JSON (useful for complex/high-dimensional data sets) - HTML (a markup language to define the structure and layout of webpages) - Time series - Text and images
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000 14/10/2020,14,10,2020,66,0,Afghanistan,AF,AFG,38041757,Asia,1.94523087 13/10/2020,13,10,2020,129,3,Afghanistan,AF,AFG,38041757,Asia,1.81116766 12/10/2020,12,10,2020,96,4,Afghanistan,AF,AFG,38041757,Asia,1.50361089
<records> <record> <dateRep>14/10/2020</dateRep> <day>14</day> <month>10</month> <year>2020</year> <cases>66</cases> <deaths>0</deaths> <countriesAndTerritories>Afghanistan</countriesAndTerritories> <geoId>AF</geoId> <countryterritoryCode>AFG</countryterritoryCode> <popData2019>38041757</popData2019> <continentExp>Asia</continentExp> <Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>1.94523087</Cumulative_number_for_14_days_of_COVID-19_cases_per_100000> </record> <record> <dateRep>13/10/2020</dateRep> ... </records>
<records> <record> <dateRep>14/10/2020</dateRep> <day>14</day> <month>10</month> <year>2020</year> <cases>66</cases> <deaths>0</deaths> <countriesAndTerritories>Afghanistan</countriesAndTerritories> <geoId>AF</geoId> <countryterritoryCode>AFG</countryterritoryCode> <popData2019>38041757</popData2019> <continentExp>Asia</continentExp> <Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>1.94523087</Cumulative_number_for_14_days_of_COVID-19_cases_per_100000> </record> <record> <dateRep>13/10/2020</dateRep> ... </records>
<records> <record> <dateRep>14/10/2020</dateRep> <day>14</day> <month>10</month> <year>2020</year> <cases>66</cases> <deaths>0</deaths> <countriesAndTerritories>Afghanistan</countriesAndTerritories> <geoId>AF</geoId> <countryterritoryCode>AFG</countryterritoryCode> <popData2019>38041757</popData2019> <continentExp>Asia</continentExp> <Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>1.94523087</Cumulative_number_for_14_days_of_COVID-19_cases_per_100000> </record> <record> <dateRep>13/10/2020</dateRep> ... </records>
The actual content we know from the csv-type example above is nested between the ‘records’-tags:
<records> ... </records>
There are two principal ways to link variable names to values.
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
<filename>ISCCPMonthly_avg.nc</filename>.<case date="16-JAN-1994" temperature="9.200012" />.Attributes-based:
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />
Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>
JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}
XML:
<person> <firstName>John</firstName> <lastName>Smith</lastName> </person>
JSON:
{"firstName": "John",
"lastName": "Smith",
}
HyperText Markup Language (HTML), designed to be read by a web browser.
HTML documents/webpages consist of ‘semi-structured data’:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<h2> hello, world </h2>
</body>
</html>
In this example, we look at Wikipedia’s Economy of Switzerland page.
## {xml_document}
## <customers>
## [1] <person>\n <name>John Doe</name>\n <orders>\n <product> x </product>\n <product> y </ ...
## [2] <person>\n <name>Peter Pan</name>\n <orders>\n <product> a </product>\n <product> x < ...
# load packages
library(xml2)
# parse XML, represent XML document as R object
xml_doc <- read_xml("data/customers.xml")
xml_doc
‘customers’ is the root-node, ‘persons’ are it’s children:
# navigate downwards persons <- xml_children(xml_doc) persons
## {xml_nodeset (2)}
## [1] <person>\n <name>John Doe</name>\n <orders>\n <product> x </product>\n <product> y </ ...
## [2] <person>\n <name>Peter Pan</name>\n <orders>\n <product> a </product>\n <product> x < ...
Navigate sidewards and upwards
# navigate sidewards persons[1]
## {xml_nodeset (1)}
## [1] <person>\n <name>John Doe</name>\n <orders>\n <product> x </product>\n <product> y </ ...
xml_siblings(persons[[1]])
## {xml_nodeset (1)}
## [1] <person>\n <name>Peter Pan</name>\n <orders>\n <product> a </product>\n <product> x < ...
# navigate upwards xml_parents(persons)
## {xml_nodeset (1)}
## [1] <customers>\n <person>\n <name>John Doe</name>\n <orders>\n <product> x </product ...
Extract specific parts of the data:
# find data via XPath customer_names <- xml_find_all(xml_doc, xpath = ".//name") # extract the data as text xml_text(customer_names)
## [1] "John Doe" "Peter Pan"
# load packages
library(jsonlite)
# parse the JSON-document shown in the example above
json_doc <- fromJSON("data/person.json")
# look at the structure of the document
str(json_doc)
## List of 6 ## $ firstName : chr "John" ## $ lastName : chr "Smith" ## $ age : int 25 ## $ address :List of 4 ## ..$ streetAddress: chr "21 2nd Street" ## ..$ city : chr "New York" ## ..$ state : chr "NY" ## ..$ postalCode : chr "10021" ## $ phoneNumber:'data.frame': 2 obs. of 2 variables: ## ..$ type : chr [1:2] "home" "fax" ## ..$ number: chr [1:2] "212 555-1234" "646 555-4567" ## $ gender :List of 1 ## ..$ type: chr "male"
The nesting structure is represented as a nested list:
# navigate the nested lists, extract data # extract the address part json_doc$address
## $streetAddress ## [1] "21 2nd Street" ## ## $city ## [1] "New York" ## ## $state ## [1] "NY" ## ## $postalCode ## [1] "10021"
# extract the gender (type) json_doc$gender$type
## [1] "male"
-> Exercise session next week